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Abstract 

We address the issue of recovering the structure of large sparse directed acyclic 
graphs from noisy observations of the system. We propose a novel procedure based on 
a specific formulation of the U-norm regularized maximum likelihood, which decom¬ 
poses the graph estimation into two optimization sub-problems: topological structure 
and node order learning. We provide oracle inequalities for the graph estimator, as 
well as an algorithm to solve the induced optimization problem, in the form of a 
convex program embedded in a genetic algorithm. We apply our method to various 
data sets (including data from the DREAM4 challenge) and show that it compares 
favorably to state-of-the-art methods. 
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1 Introduction 


Revealing the true structure of a complex system is paramount in many fields to identify 
system regulators, predict i ts behavior or decide where interventions are needed to disen¬ 


tangle direct relationships (Newman), 2003 : Souma et, ah . 2006; Verma et al. . 2014J ). This 


problem can often be seen as a graph inference problem. Given observational data, we aim 
at predicting the presence (or absence) of edges between elements of the system, which 
form the vertices of a graph. Edges specify the relational structure of the system depicted 
as a graph or network. As a motivating problem, the reconstruction of Gene Regulatory 
Networks (GRN), which model activation and inhibition relationships betwee n genes, is 
one of the main challenges in modern computational biology ( Barabasi fc Oltval 20041 ). 

A popular approach consists in assuming that the data are generated by a Directed 
Acyclic Graph (DAG) ( Pearl . 20091 ). DAGs are made of a collection of vertices, which 
stand for variables, and directed edges to model the dependency structure among the vari¬ 
ables, avoiding self-loops and cycles. However, inferring a DAG is a rather challenging 
problem. Firstly, the number of nodes p of the graph may be so large that exploring 


tures is super-exponential in p (' 

dobinson, 

1973- 

Koivisto & Sood, 

2004 

Tsamardinos et all 

2006; 

Grzegorczvk & Husmeier 

2008 

). Another dimension flaw occurs when p , even being 


reasonable, is larger than the number of observations, and parameter estimation is jeop¬ 
ardized. High-dimensional statistical techniqu es are then needed to overcome this issue 
f Bii him arm &: van de Geer . 2011 : Giraud, 20151) . Secondly, even if the ratio between p and 
the sample size n is not imped i ng model estimation, the nature of the data can be an addi¬ 
tional obstacle ( Ellis fe WoneJ. 2008 : Guvon et al. . 2010l: Fu fe Zhoul. 20131) . The available 
observational data are in general not sufficient to identify the true underlying DAG, and 
can only determine an equivalence class of DAGs ( Verma fe Pearl . 199lt) . This approach 
relies on the assumption that the j oint distribution is Markov and faithful with respect to 
the true graph ( Spirtes et all 2000 ). 

A large number of methods have been pr oposed for estimating DAGs, including for 
instance score-ba sed methods ( Bayesian score, iFriedman &; Kollerl 120031 or Bayesian Infor¬ 
mation Criterion. ISchwarzlll978f) . complex space sampling (jZhoul.l201ll) or the PC algorithm 


(ISpirtes et ah . 2000 ). The latter has been proved to be uniformly consistant in the high¬ 


dimensional case, but requir es a test of condition al independences that quickly becomes 
computationally intractable ( Kalisch V Biihlma/nn . 20071) . 

In this work, we focus on Gaussian structural equation models ( Wright . 192ll) associated 
with maximum likelihood estimators (MLE). In the last years, the -^-regularization of the 
MLE drew the attention of a large number of works since it leads to infer sparse graphs. In 
DAGs, a not necessarily topological ordering of the nodes can always be defined according to 
edge distribution (IKahnl. If 9621) . Identifying this ordering is know n to be a challenging prob¬ 
l em (ICookl. Il985l) . When the order of the variables is unknown, Ivan de Geer fe Buhlmann 
(120131) studi ed the convergence of the In-penalized likelihood. Howeve r, the ^-regularized 
approaches (ISilander &; Mvllvmakil . 12006c lHauser fc Biihlmamil . 120121) remain impractical 
for estimating graphs with more than 20 vertices, either due to an exhaustive exploration 
of the set of DAGs or overfitting (IChen fe Chenl . 20081 ). For a known order among the 
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variables in the graph, ISho i aie fe Michailid isl (20101) present results for the estimation of 
high-dimensional graphs based on independent linear regressions using an adaptive Lasso 
scheme. 

Onr objective consists in overcoming this drastic dimensional limitation, and find infer¬ 
ence strategies for graphs with np to several hundred nodes. Such strategies must ensure 
a high level of sparsity, be supported by computationally affordable algorithms, while pre¬ 
serving sound the oreti cal bases. Here, we propose t o use the fj-regularization, similarly to 


Fu fe Zhoul (120131 1 and IShoiaie fe Michailidisl (120101 1 . to penalize the MLE. From a compu¬ 


tational point of view, this regularization makes the criterion to maximize partially convex 
while ensuring sparse estimates. Our contribution is two-fold: firstly, we provide oracle 
inequalities that guarantee good theoretical performances of our proposed estimator in the 
sparse high-dimensional setting. Secondly, we provide an efficient algorithm to infer the 
true unknown DAG, in the form of a convex program embedded in a genetic algorithm. 

The next section covers the model definition and the associated penalized MLE problem. 
Section [3] details the oracle inequalities, and Section [4] our inference algorithm. Section [5] 
reports numerical experiments both on toy problems and realistic data sets. 


2 The ^-penalized likelihood for estimating DAGs 


2.1 DAG’s modelling and estimation 

This work considers the framework of an unknown DAG Qq = (V, E ), consisting of vertices 
V = {l,...,p} and a set of edges E C V X V. The y nodes are associated to random 
variables X 1 , ...,X P . A natural approach, developped by Meinshausen fc Biihlmannl ( 20061) 
to solve the network inference problem is to consider that each variable A"* (1 < i < p ) of 
the DAG can be represented as a linear function of all other variables A"- 7 (j ^ i ) through 
the Gaussian Structural Equation Model: 


p 


VjG[l,p], X J = ^{G 0 )lX l + £^ 

i —1 


(i) 


with e- 7 ~ jV(0, <7j) (a? known) a Gaussian residual error term. The set of edges E corre¬ 
sponds to the non-zero coefficients of G o, i.e. {Go)j encoding the relationship from variable 
X 7 to variable XL 

Assume that we observe an n-sample consisting of n i.i.d. realizations (X 1 ,...,X P ) of 
Equation ([T)), distributed according to a A/”(0, E) law where £ is non-singular. We denote 
by X := (A" 1 , ...,X P ) the n x p data matrix. The relations between the variables can then 
be represented in its matrix form: 


X = XG 0 + £, (2) 

where G 0 = ((G 0 )i)i<ij< p is the p x p matrix compatible with the graph Q 0 and e := 
(e 1 ,..., e p ) is the n x p matrix of noise vectors. 
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( 3 ) 


The negative log-likelihood of the model is then ( Rau et all [2013): 

p n p 1 


«(G) = f logM+n^T log (Tj + EE -(x k (i-Gyy. 

3 =1 


07 

fc=i i=i •? 


To recover the structure of the DAG (p and make the estim ated graph sparse enough, 
we focus on a penalized maximum likelihood procedure (IBickel fc Lil . 20 06): 


G = argmin 

GgGdag 


(£(G) + Apen(G)}, 


(4) 


where £(.) is the negative log-likelihood of Equation Q, pen(.) is a penalization function, 
A is a trade-off parameter between penalization and fit to the data, and Gdag is the set of 
p x p matrices compatible with a DAG over p nodes. 


In the setting of Gaussian structural equation models with equal noise variance, Peters et ah 


(20111 2014) showed that the true DAG was identifiable for respectively discrete and con¬ 
tinuous data. In a nutshell, it implies that the true DAG could be inferred, not just 
the Markov equivalence class of the underlying DAG - a partially directed graph exactly 
encoding the conditional dependency structure. Using an pj-norm regularization in Equa¬ 
tion g)) is an attractive option to infer sparse graphs. From a computational point of 
view, the main difficulty when solving the optimization problem in Equation (J3J) lies in 
exploring the set of DAGs Gdag ■ ( Chickerina . 1996 1 showed it to be an NP-hard prob¬ 
lem: an p-regularization does not set a favorable framework for this task. To avoid 
the whole exploration of Gdag-, a dynamic programming method has been proposed in 
Silander &; Mvllvmakil ( 2006 1. using a particular decomposition of th e In-p enalized maxi¬ 
mum likelihood. The gre edy equivalent search algorit hm of lChickerind (120021) is a hill climb¬ 
ing alternative method. ( Hauser fc Biihlma/nn . 2012h rather restricted the search space to 
the smaller space of equivalence classes, and they provide an efficient algorithm without 
enumerating all the equivalent DAGs. They showed that they are asymptotically optimal 
under a faithfulness assumption (i.e. independences in the distribution are those read from 
Go). However, none of the approaches above can be used on high-dimensional data to esti¬ 
mate graphs with a large number of nodes. In this context, we focus on the p-norrn convex 
regularization instead of £q for its sparse, high-dimensional and computational properties. 

The £ i-regularization clearly improves the computation of (JU). It allows us to write 
a convex formulation of the problem (see Section 12T2]) . Given Equation ((2]) and omitting 
constant terms, the p-penalized likelihood estimator we consider is: 


G = argmin 
G&Gdag 


i||X(/-G)||J. + A||G|| 1 


(5) 


2 

where, for any matrix M := {Mi)i<i,j< P , we denote by ||M|| F = JA . (M/) the Frobenius 
norm and by ||iW|| x = JA . |Mf | the p-norm. 


2.2 A new formulation for the estimator 

We propose here a new formulation of the minimization problem of Equation (l5l) . It 
naturally decouples the initial problem into two steps of the minimisation procedure: node 
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ordering and graph topology search. A key property is that any DAG leads to a topological 
ordering of its vertices, denoted <, where a directed pa th from node A* to node X J is 
equivalent to X J < X 1 f Kahril . 1962 : Cormen et all 2001 1 (see Example [1] below for more 
explan ations). This ordering is not unique in general. Proposition 12.11 from Biihlmann 
(120131 ) then gives an equivalent condition for a matrix to be compatible with a DAG. 


Proposition 2.1 (Biihlmann, 2013) A matrix G is compatible with a DAG Q if and 
only if there exists a permutation matrix P and a strictly lower triangular matrix T such 
that: 

G = PTP t . 

Graphically, the permutation matrix sets an ordering of the nodes of the graph and is 
associated to a complete graph. The strictly lower triangular matrix T sets the graph 
structure, i.e. the non-zero entries of G, as illustrated in Example [0 

Example 1 Consider the DAG Q given in Figure[T\ (left). The corresponding matrix G 
can then be written as the strictly lower-triangular matrix T by permutation of its rows and 
columns using P: 
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with T = 


Looking at the non-zero values of P column by column, P defines a node hierarchy X 5 < 
X 3 < X 4 < X 1 < X 2 compatible with the topological orderings of Q. Graphically, P is 
associated to the complete graph represented in Figure\T\ (right). The dashed edges then 
correspond to the lower zero entries ofT. Note that since A" 3 is not connected with X 1 , X 4 
and X 5 , four topological ordering are possible (X 5 < A 4 < A" 1 < A 3 , A" 5 < A 4 < A" 3 < 
A 1 , A 5 < A 3 < A 4 < A 1 and A 3 < A 5 < A 4 < A 1 ;. 

Using Proposition 12.11 the estimator in ([2]) leads to the following equivalent optimization 
problem: 

(P,T) = argmin ( - ||A (I - PTP t ) \\ 2 p + A \\Tl\A , (6) 

(p,T)ec in" J 

where the optimization space of constraints C is defined as C = P p (R) x T p (R), with P ? 


the set of permutation matrices and T p (R) the set of strictly lo wer-t ri angular matri ces. 
Note that a similar formulation has already been proposed by Ivan de Geer fe Biihlmann 
(120131 ) to ensure good theoretical properties for the ^ 0 -penaliz:ed log-likelihood estimation. 
However, it has never been exploited from a computational point of view to recover the 
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Figure 1: An example of DAG Q (top) and the action of P and T on Q: P is associated to 
a complete graph that orders the nodes of the graph (bottom) and T sets the weights on 
the edges. The dashed edges correspond to null weight edges (a zero entry in T ). 


graph structure optimizing problem (15j) . In the following two sections, we propose a the¬ 
oretical analysis of the proposed estimator (Section EJ) and a computationally effhcient 
algorithm to solve Problem (J6]) (Section [4]). 


3 Oracle inequalities for the DAG estimation 


The main result of this section deals with convergence rates: in Theorem [T] we provide 
upper bound for error associated with the G-penalized maximum likelihood estimator con¬ 
sidered in Equation (|6l) , both in prediction (Equ a tion [71) and estimation (Equation [8]) ■ 


Following the works of van de Geer fe Biihhnannl ( 2013 ) on the -^-penalized maximum 


likelihood estimator and of Bickel et ah f 2Q09h on the Lasso and the Dantzig Selector, we 


obtain two convergence results under some mild sparsity assumptions, when the number of 
variables is large but upper bounded by a function tp{n) of the sample size n. 


3.1 Estimating the true order of variables 


For a known ordering among the variables of the graph (jShojaie & Michailidiq, 2010J), an 
unrealistic assumption in many applications, the DAG inference problem can be cast in a 
convex optimization problem. To provide oracle inequalities of the proposed estimator in 
the most general case of an unknown order we consider here, we first focus on the problem 
of estimating the true variable order. Let us denote by n 0 the set of permutation matrices 
compatible with the true DAG Qq\ 


n 0 = {P G P p (M), P t GqP G T p (M)} . 

n 0 contains one or more permutation matrice(s) (see Example [T]). We will have to make a 
decision as to whether the estimated order of variables P given by Equation dHJ) is in ITo or 
not. 

To answer this question, we investigate the effect of learning an erroneous order of 
variables P n 0 . We introduce the following additional notations: for any permutation 
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matrix P 6 P p (M), we denote by Go(P) the matrix defined as: 


G 0 (P) = PT 0 P T , 


with To = PqGqPq a lower triangular decomposition of Go- From a graphical point of view, 
while P ^ n 0 , the graph Go(P) associated to Go(T’) is obtained from Go by permuting 
some of its nodes (see Example [2]), otherwise, if P 6 n 0 , Go(P) = Go- We also denote by 
s(P) := X — XGo(P) the associated residual term. We denote by Gl(P) the covariance 
matrix of e(P) and uJj(P) Var(e J ’(P)) the associated noise variances of each node. 

With these notations and checking that the assumptions presented in Section 13.21 hold, 
we ensure that, with large probability, we choose a right order of variables and the estimated 
graph converges to the true graph when n and p grow to infinity (see Section l3T3]h 


Example 2 Let 


P = 


/0 0 0 
0 0 0 
0 1 0 
1 0 0 
\0 0 1 


0 

1 

0 
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1\ 

0 

0 

0 

0 ) 


i n 0 


a wrong permutation. 

In Figured we represent the permuted graph Go{P ) (right) associated to the graph Go 
(left). The latter is obtained from Go after permutation of its nodes using PP(f, where P 0 
(corresponding to the matrix P in Example GJ] defines a right order of variables. 



Figure 2: The graph Go (top) and the permuted graph Go(P) (bottom) associated to the 
permutation P. 


3.2 Assumptions on the model 

For a square matrix M 6 At pX p(®) and a subset S of [l,p] 2 , we denote by Ms e At pxp (K) 
the matrix that has the same elements as M on S and zero on the complementary set S c 
of S. We now introduce the assumptions we used to obtain statistical properties of our 
estimator. 
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Hypotheses 

Hi There exists a 2 > 0 such that 


Vj G [l,p], Var(e J ) = a 2 . 

H 2 There exists <Jq, independent of p and n, such that 

max VarfX 7 ) < crn. 

i <j<p 

H 3 There exists A* > 0 such that the minimal eigenvalue of the covariance matrix £ of X 
satisfies 

A min > A* > 0. 

H 4 There exists g m ax < oo such that the maximal weight of the DAG Qq is bounded 

max I(G 0 )-I < gmax- 

H 5 The number of nodes p satisfies 


plogp = 0{n). 


H 6 There exists k(s) > 0 with 1 < s 2 < p 2 such that: 


min 


\\xm\\ f \ 
Vn\\M s \\ F f 


> k(s), 


where the minimum is taken over the set of p x p matrices satisfying < 3 

with S C [l,p ] 2 and |«S| < s. 

H 7 There exists 0 < p < x A such that, for all permutations P £ fl 0 , 



Assumption H x states that the noise variances are the same among all variables. This 
assumption is clearly hard to test in practice but makes the problem identifiable and ensures 
that we can recover the true DAG. Otherwise, minimizing ([5]) only leads to the identification 
of one element of the Markov equivalence class of the true DAG (partially directed graph). 
To simplify the theoretical results and proofs, until the end of this work, we assume that 
the noise variances a 2 are equal to one. Our results are still valid even if a 2 7 ^ 1, by small 
modifications in the constant terms as long as they are all equal. 

Assumption H 5 deserves a special attention since it bounds the high dimensional set¬ 
ting. The considered problem is obviously non-trivial and requires a sufficient amount of 
information. A more detailled discussion about assumptions H 3 and H 5 is proposed in 
Section 13.41 


Assumption H 6 is a natural extension of the Restricted Eigenvalue condition of Bickef et al 


(12009) to our multi-task setting. More precisely, denoting 








X = 


X 0 


0 X 


n x p, 


p 


H 6 is equivalent to assum i ng that the Gram matrix is non-degenerate on a restricted 
cone ( Lounici et ah . 2009 : Biihlmann &; van de Geer . 201 lh . Notice that this condition is 


very classical in the literature. It yields good practical performance even for small sample 
sizes, and some recent wor ks discuss an accurate population eigenva l ue estimation even in 
a larg e dimension setting f Mestrel . 2008; El Karoul 12008 : Liu et al. . 2014 : Ledoit fc Woll . 

201 fin . 

The last assumption H 7 is an identifiability condition needed to ensure that th e est i- 
mated permutation P is in n 0 . This assumption was introduced by van de G ee r fc Bii hlman n. 
( 20131 ) as the “omega-min” condition. In a sense, it separates the set of compatible per¬ 
mutations from its complement in a finite sample scenario. 


3.3 Main result 

The result we establish in this section is double-edged: (a) with large probability, we 
ensure that the estimated P belongs to n 0 , and (b) we provide oracle inequalities both in 
prediction and estimation for the graph estimated from the minimisation problem (j 6 j) . This 
result clearly states the desirable theoretical properties of the derived estimator, assuming 
reasonable conditions on the complex system embedding the data. 


Theorem 1 Assume that Hi_ 7 are satisfied, with s 2 C [l,p 2 ] in H 6 such that X^jl(G 0 )V 0 — 


s 2 (G 0 is s 2 -sparse). Let A = 2 G^s^p. Then, with probability greater than 1 — 5/p, any 

solution G = PTP T of the minimization problem m satisfies that P e Il 0 . Moreover, with 
at least the same probability, the following inequalities hold: 


1 

n 


XG - XG 0 


2 < 16 C,2 „3 lQ g P 
f ~ n 2 (s) n 


(7) 


G-G 0 


< 


! 6 G 5/2 /bjp 

k 2 (s) V n 


( 8 ) 


The proof of this result is deferred in Section C of the Supplementary Materials. 

Theorem Q] states that with probability at least 1 — 5 /p, we choose a compatible order of 
variables over the set of permutations. Inequalities (ED and (JBD give non-asymptotic upper 
bounds on the loss under conditions depending on s, p and n (see Section HTTP . They also 
ensure that the estimated T is close to the true To with large probability. 
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3.4 Discussion on the high-dimensional scenario 


Sparsity of the graph Assumption H 7 and Theorem Q] naturally require a trade-off be¬ 
tween signal sparsity, dimensionality and sample size. In the ultra sparse regime (where the 
sparsity s 2 of the true graph is bounded by s* >0), Theorem Q] provides oracle inequalities 
for G choosing r) < with plog(p) = cm in Assumption H 7 . 

In the standard sparsity scenario, if s 2 is at least of the order of p, then p should be 
of the order of a/^/p, which is unrealistic as p —> + 00 . This case thus requires a stronger 
dimensional assumption H 5 . Taking at least p 2 log(p) = 0(n ) ensures a good estimation 
of the graph. 

Note however that universal cond i tions cannot be overcome and the ultra-high dimen¬ 
sion settings (e.g. Wainwright (120091) ; Verzelen (2012)) is an insurmountable limit. 


Minimal eigenvalue condition Assumption H3 ensures that the minimal eigenvalue 
of the covariance matrix E of X is not too small. In the high-dimensional scenario, this 
could be hard to verify, X min decreasing while n,p growing to infinity ( Hogben . 2007 ). A 
natural bound for \ min is: 


^min 9 \ \ > (9) 

Pmax(l,^j (1 + s) 
with Qrnax and s as in H 4 and Theorem [0 

Assumption H 3 can thus be relaxed by allowing \ min to decrease with 1/ps. The 
price to pay for this relaxation is a data dimensionality reduction p 3 log(p) = 0(n), which 
automatically implies: 


3A mi „_ 2i /log(p)_ 3(T()i /2plog(p) >0i 


n 


n 


with Equation P) (for more details, see Section A of the Supplementary Material Proof 
details ). 


4 Inference algorithm 

4.1 Global algorithm overview 

In this section, we present GADAG (Genetic Algorithm for learning DAG ) a computational 
procedure devoted to solve Equation ([6]). Although decomposing the original problem made 
it more natural to handle, this problem is still a very challenging task from an optimization 
point of view, due to the different nature of the variables P and T, the non-convexity of 
the cost function and the high dimension of the search space. 

An intuitive approach consists in using an alternating scheme: one of the variables P 
or T is fixed and the other one is sought so as to optimize the score function, then the 
roles of P and T a re reversed a nd the procedure is repeated iteratively until convergence 
for some criterion ( Csiszar &; Tusnadvl . 19841) . However, the structure of our problem does 
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not allow us to use such a scheme: looking for an optimal T given a fixed P makes sense, 
but changing P for a fixed T does not. 

In our inference algorithm GADAG, an outer loop is used to perform the global search 
among the DAGs space, which is driven by the choice of P, while a nested loop is used 
to find an optimal T for each given fixed P (see Figure [3]). As we show in the following, 
population-based metaheuristics algorithms are a natural and efficient choice for exploring 
the space of permutation matrices (Section 14. 3p . The nested optimization problem can be 
resolved using a steepest descent approach (Section 14.21) . 



END 


Figure 3: Overview of our hybrid algorithm GADAG. 


4.2 Graph structure learning when the variable order is fixed 


Assume first that the variable ordering P G P p (R) is fixed. The problem of inferring a 
graph is then reduced to estimating the graph structure, which can be solved by finding a 
solution of: 

1 'pa-PTP^IIl + Aliriid. (10) 


nun 

TeTp(R 


n 


The minimization problem given by Equation (fTUD looks like a well-studied problem in 
machine learning, as it is closely relat ed to t he G -const rained quadratic program, known as 
the Lasso in the statistics literature (Ti bsh irani. 1996). Indeed, the i \-regularization leads 
to variable selection and convex constraints that make the optimization problem solvable. 
We note here that this allows us to always provide a locally optimal solution, i.e optimal 
weight estimates given a hierarchy between the nodes. 

A large number of efficient algorithms are available for computing the entire path of 
solutions as A is varied, e.g. the LARS algorithm of Efron et ah (120041) and its alter¬ 
natives. For example, in the cont e xt of the estimation of sparse undirected graphical 
models, Meinshausen fe Biihlma/nn ( 2Q06h fit a Lasso model to each variable, using the 
others as predictors, and define some rules for model symm etrization as they do not work 
on DAGs. The graphical Lasso (or glasso, Friedman et al. 2007) algorithm directly relies 
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on the estimation of the inverse of a structure cova riance matrix assumed to be sparse. 
Improveme nts were p roposed for example by Duchi et al. ( 20081 ) (improved stopping cri¬ 
terion) and Witten et ah f 201 lh (estimation of a block-diagonal matrix). Other authors 
propose to solve the optimization prob lem using an adaptation of classical optimization 
methods, such as interior point (Yuan fc Linl. 20071) or block coordinate descent methods 


flBaneriee et all 12008c iFriedman et al.l. 120071) 


We propose here an original convex optimization algorithm to find the solution in Equa¬ 
tion (HOD in a form similar to a steepest descent algorithm. Our proposed algorithm is much 
quicker than a glasso approach, a desirable feature as it will run at each iteration of the 
global algorithm (see the “Search of an optimal T* n box in Figure |3] and the “Evaluate the 
new individuals” item in Algorithm [2]). Moreover, its mechanistic components (see Section 
B of the Supplementary Material Proof details ) allowed us to derive the theoretical results 
of Theorem [Q The proposed scheme can be seen as an adaptation of the LARS algorithm 
with matrix arguments. 

Let (Tk)k >o the sequence of matrices defined for all i,j £ ([ 1 , p ] 2 


as: 


{Tk+i)i = sign ((U k )l) max ( 0, | (U, 


( 11 ) 


V( — IIA'(/—PTj.P t )|| 2 ) 

where for all k > 0,t4 = T k -—- j - p is the Lipschitz constant of the 

gradient function V ^ ||W(J — PT k P T )^ 2 p S j and sign() is the sign of any element. Then, a 
solution of (ROD is given by performing Algorithm [TJ where: 

• the projection Proj T ( R )(T) of any p x p real-valued matrix T = (('Tfe)Oij on the set 
T p (M) is given by 

fprowrARl 0 f ' u<3 ' 

V p 1 (Tfc)i otherwise. 


( 12 ) 


the gradient of 1 ||X(/ — PT k P q 


V [-\\X{I-PT k P 
' n 11 


F 

,T\ II 2 


IS 


=- (XP) T (X - XPT k P T )P. 


n 


(13) 


The detailed calculations are deferred to Section B of the Supplementary Material Proof 
details. 


4.3 A Genetic Algorithm for a global exploration of the permu¬ 
tation matrices space incorporating network topologies 

As the optimal T can be calculated for any P using Algorithm [T| and with a very good 
approximation accuracy according to Theorem [Q the optimization task ([ 6 ]) comes down to 
exploring the P p (M) space of permutation matrices in dimension p and to evaluating the 
quality of permutation candidates P £ P p (M). We first note that the number of permutation 
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Algorithm 1: Graph structure learning - minimization of (flUl) 

Input: A, L, e > 0. 

Initialization: To the null squared p x p matrix, k — 0 and e = +oo. 

while e > e do 

vf— IIX(7—PTfeP T )|| 2 ) __ 

Compute U k = T k -—- F -— with Equation (TT3]h 

Using Equation (fTTTh compute the current matrix T k+1 = ((T fc+1 )^)^.; 
Project T k+1 on T P (R) with Equation (H2J) : Tk+i Proj Tp(R) (T fe+1 ); 
Compute e = ||T fc+1 - T k \\ F ; 

Increase k: k <— k + 1; 
end 

Output: T k G T p (R) the unique solution of (fTOlh 


matrices is p!, which rules out any exact enumeration method, even for relatively small p. 
We propose instead to use a meta-heuristic approach, which has proven to be successful for 
many discrete opti mization prob l ems like wire-rout i ng, tr ansportation problems or traveling 
salesman problem f Michalewicz . 1994 : Dreo et ahl. 2006h . 

Among the different meta-heuristics (Simulated annealing, Tabu search, Ant Colony,...) 
we focused on Genetic Algorithms (GA) because, despite limited convergence results ( Cerl 


19981: iMichalewiczl . 119941) , they were found much more efficient i n problems related to o urs 


than alternatives with more established convergence proofs (e.g. Granville et al. (119941) for 
simulated annealing), while allowing the use of parallel computation. 

GAs mimic the process of natural evolution, and use a vocabulary derived from natural 
genetics: populations (a set of potential solutions of the optimization problem), individuals 
(a particular solution) and genes (the components of a potential solution). Each genera¬ 
tion/iteration of the algorithm will improve the constituting elements of the population. 
In short, a population made of N potential solutions of the optimization problem samples 
the search space. This population is sequentially modified, with the aim of achieving a 
balance between exploiting the best solutions and exploring the search space, until some 
termination condition is met. 

We use here a classical Genetic Algorithm (GA), as described in Michalewicz (119941) for 
instance, which is based on three main operators at each iteration: selection, crossover and 
mutation. The population is reduced by selection; selection shrinks the population diversity 
based on the individual fitness values. The crossover allows the mixing of good properties 
of the population to create new composite individuals. Mutations change one (or a few in 
more general GAs) components of the individuals to allow random space exploration. The 
complete sketch of algorithm GADAG is given in Algorithm [ 2 j A discussion on parameters 
to set in Algorithm [2] is found in Section 15.11 The details of the different operators are 
given in the following. 

As we show in Example |H1 any P G P p (R) is uniquely defined by a permutation vector 
of [l,p]. Hence, we use as a the search space & p the set of permutations of [l,p], which is 
a well-suited formulation for GAs. 
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Example 3 Consider the permutation matrix (p = 5 ): 


P = 


/O 0 0 1 0\ 

OOOOl 
0 10 0 0 

0 0 10 0 

\1 0 0 0 0 / 


Then, P is represented by the 


5\3\4\1\2\ 


vector, looking at the ranks of non-null values 


of P column by column. The nodes are ranked according to their topological ordering. 

Note that our problem closely resembles the classical Traveling Salesma n Pr oblem 


(TSP), which has been succesfully addressed by means of genetic algorithms (iGrefenstette et al 


19851 ; D avis. 199 1). Identically to the TSP, we optimize over the space of permutations, 
which induces specific constraints for defining the crossover and mutation operators. How¬ 
ever, unlike the TSP, the problem is not circular (in the TSP, the last city is connected 
to the first one), and the permutation here defines a hierarchy between nodes rather than 
a path, which makes the use of TSP-designed operators a potentially poor solution. As 
we show in the following, we carefully chose these operators in order to respect the nature 
of the problem at hand. In particular, we emphasize two of their desirable features: their 
efficiency in exploring the search space and the interpretable aspect they offer in terms of 
modifications on a given network or the blend of two different networks (crossover). 


Fitness function Given a potential solution p^ E & p , the fitness function is defined as: 

Ji = J(Pi) = - IPU - POtPl )||i + A 117711,. (14) 

with Pi constructed from pi as in Example [3] and T* the solution of Equation (fT0|) with 
P = Pi. As mentioned earlier, at each step of the proposed GA, the evaluation of the 
fitness function thus requires running the nested loop of our global algorithm GADAG. 


Selection operator The selection operator (or survival step) consists in generating a 
population of N individuals from the N existing individuals by random sampling (with 
replacement, hence some individuals are duplicated and others are deleted). It aims at 
improving the average quality of the population by giving to the best potential solutions 
a higher probability to be copied in the intermediate population. We have chosen to use 
the classical proportional selection of Holland! f 1975f) : each individuals is selected with a 
probability inversely proportional to its fitness value of Equation (TIT)) . 


Crossover operator A crossover operator generates a new set of potential solutions 
(children) from existing solutions (parents). Crossover aims at achieving at the same time 
(i) a good exploration of the search space by mixing the characteristics of the parents to 
create potentially new ones while (ii) preserving some of the desirable characteristics of the 
parents. By desirable features, we mean features of the network which lead to good fitness 


14 


















values, and which in turn are favored by selection over the generations. The crossover 
population (set of parents) is obtained by selecting each individual of the population with 
a probability p xo ; the parents are then paired randomly. 

We have chosen the order-based crossover, originally proposed for the TSP flMichalewic zl. 
1994 Chapter 10), which is defined as follows. Given two parents p\ and P 2 , a random 
set of crossover points are selected, which we denote hi. It consists in a permutation of 
k elements taken from [[1, p], with k uniformly drawn between 0 and p. A first child C i 
between pi and p 2 is then generated by: 


1 . fixing the crossover points of p \, 

2 . completing C\ with the missing numbers in the order they appear in p 2 . 


Example 4 Consider the two following parents: 


4 

3 

10 

7 

5 

9 

1 

2 

6 

8 


6 

1 

9 

4 

10 

2 

8 

3 

7 

5 


Assume that the crossover points randomly chosen are 4, 9, 2 and 8 (in bold red above). 
Then, the child Ci is defined by inheriting those points from pi and filling the other points 
in the order they appear in p 2 '- 


P2 

6 

l 

0 

/ 

10 

4 

0 

3 

7 

5 


\\ 

Ci 

4 

* 

* 

* 

* 

9 

* 

2 

* 

8 


4 


4 

6 

1 

10 

3 

9 

7 

2 

5 

8 


From a graphical point of view, a crossover between pi and P 2 , which encode two 
complete graphs Qp x and Qp 2 , constructs two new graphs. One of them, Qc x is composed 
of the sub-graph of Qp 1 induced by the set of crossover points fl and the sub-graph of Qp 1 
induced by the complementary set Tl c of 0 in [l,p] (see Figure [4j). The second child graph 
Qc 2 is obtained in an identical manner by reversing the roles played by the two parental 
graphs. 

Mutation operator Mutation operators usually correspond to the smallest possible 
change in an individual (unary operator). We thus define it as an alteration of two neigh¬ 
bouring genes (see Example [5]). Graphically, a mutation consists in switching the arrowhead 
of an edge between two nodes. Mutation is applied to each child with probability p m . 

Example 5 A possible mutation for the first child of Example [4] is to swap the genes “1 ” 
and “10” (in bold red below): 


4 

6 

1 

10 

3 

9 

7 

2 

5 

8 


FT 
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Figure 4: Graphical representation of crossover between two 10-node graphs. The two 
parental graphs are represented on the left. The third graph, on the right, is obtained by 
combining the blue and red part of its parents using the crossover operator. 


Stopping criterion Two quantities are monitored along the iterations: the heterogeneity 
of the population and the value of the objective function. 

For the first indicator, we use the Shannon entropy, defined for each rank position 
3 e [ 1 ,P] as: 



where Afj is the number of times when i appears in position j. Hj = 0 if all the individuals 
“agree” on the position of a node and the population is perfectly homogeneous at this 
node. On the contrary, it is maximum when we observe a uniform distribution of the 
different nodes at a given position and the population is in this case at a maximum of 
heterogeneity or disorder for this position. The algorithm stops if the population entropy 
value H = j Hj drops below a threshold since H = 0 if all the individuals are identical. 
A second criterion can terminate GADAG if difference in the average fitness (denoted J 
thereafter) of the population between a given number of consecutive iterations, does not 
change by more than a predefined threshold. 

5 Numerical experiments 

This section is dedicated to experimental studies to assess practical performances of our 
method. We demonstrate the ability of our algorithm to analyse data sets, which mimic 
the activity of a complex biological system, and we compare it to other state-of-the art 
methods. In Section 15.11 we present the calibration of the Genetic Algorithm parameters. 
The competing methods are presented in Section 15.21 and Section 15.31 introduces the mea¬ 
sures we used to assess the merits of the different methods. Experimental results are then 
detailed in Section 15.41 
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Algorithm 2: GAD AG overview 
Input. PxoiPmi £H 0; 0; k max 0, imax 0; ^■ L. 

Initialization: Generate the initial population P (l with A permutations of [l,p], 
k = 0 and ej = +oo. 

while H > en & ej > ej& k < k max do 

Generate Vk+i as a random selection of N individuals from Vk\ 

Pick an even subset V xo of Vk+i (each individual of Vk+i selected with 
probability p xo ); 

Perform crossover on V xo by randomly pairing the individuals; 

Mutate each obtained individual with probability p m ; 

Evaluate the new individuals V rn by running Algorithm [T] 

Replace V xo by V m in V k +i ; 

Compute the Shannon entropy H and the difference in the average fitness 
ej = max 0 <i< jra „ {J(V k+ 1 ) - J(P fc _i)); 

Increase k: k k + 1; 

end 


5.1 Algorithm parameters 


Running the procedure of Algorithm [2] requires to define parameters of the outer loop, 
which generates our population of P’s, and of the nested loop to find the optimal T*. The 
evaluation of the Lipschitz gradient constant L, used to find the optimal graph structure T*, 
is known as a hard established problem in op timization. Some authors propose to choose a n 
estimate of L from a set of possible v alues (Jones et al. . 19931: Sergevev fe Kvasov . 2006 1. 
to estimate local L i pschi t z constants (ISergeyevl.ll995h . or to set it a priori to a fixed value 
( Evtushenko et all 2009 : Horst fc Pardalos . 1995 1. Here, observing Equation (|T3ll . a major 
bound for L is given by: 

2 ,, t ,, 

L<- \X T X 
n 

We found that setting L to this bound worked well in practice in all our scenarios. 

Five parameters need to be tuned to run the Genetic Algorithm: the crossover rate 
p xo , the mutation rate p m , the constant of the stopping criteria e# and ej and the size 
of the population N. For the first four parameters, we observed that their value had a 
limited effect on the efficiency, hence we chose commonly used values in the literature (see 
Table [1]). The size of the populati on has a more complex effect and has been investigated 


complex eliect and nas been investigated . 

_ 19891: Alanderl 19921 : Piszcz fe Soule 20061 : 

Rideel 2007 1 but without providing a definitive answer to the problem. In our simulation 


in several p rospective papers (e.g. ISchaffer et al 


study, we chose as a rule-of thumb N = 5 p, which was found as a good compromise between 
computational cost and space exploration on several experiments. 

The complete parameter settings used in our experiments are reported in Table |T] 
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Table 1: Algorithm parameter settings 


Parameter 

Value 

Pxo 

0.25 

Pm 

0.5 

N 

5 x p 

L 

l n\\X T X\\ F 

max. nb. of eval. 

5 x p 


10" 6 

ej 

lO” 4 


5.2 Comparison to state-of-the art 

We compare GADA G to fiv e stat e-of-the-art inference methods. Among them, the Genie3 
method ( Hnvnh-Thu et al. . 2010l) . based on random f orests, was the best p erformer of one 
of the DREAM4 sub-challenges, while the BootLasso ( Allouche et all 2013 1 was one of the 
key co mponents of the best performing approach of one of the DREAM5 sub-challenges 
( Allouche et al. . 2013 ). The two methods decompose the prediction of the network into p 
feature selection sub-problems. In each of th e y su b-problems, one of the node is predicted 
from the other ones using random forests (Brciman, 200 ill for Genie3 or a bootstrapped 
version of the Lasso ( Bach . 20081 ) for BootLasso. For the random forest approach, parents 
of each node were detected as most significant explanatory variables according to a vari¬ 
ance reduction criterion in a regression tree framework. The process was repeated on a 
randomized set of trees, which made up the so-called random forest. This method allowed 
us to derive a ranking of the importance of all variables for the t arget b y ave raging the 
scores over all the trees. We used the R package randomf orest (ILiaw fe Wienerl . 120021) 
for our results. The La sso is a h -no rm penalization technique for solving linear regression. 
Following the works of Bach ( 20081) . BootLasso uses bootstrapped estimates of the active 
regression set based on a Lasso penalty: only those variables that are selected in every 
bootstrap are kept in the model. In both cases, actual coefficient values are estimated from 
a straightforward least square procedure. Note that we slightly relax the condition for a 
variable to be included in the model, a variable was selected at a given penalty level if more 
than 80% of bootstrapped samples led to selecting it in the model ( Allouche et all 2013 ). 

We also compare our algorithm to three c lassic al methods for Bayesian Networks (BNs) 
modelling. BNs are graphical models (Pear], 20091) defined by a DAG and parameters that 
set quantitative relationships between variables. Algorithms devoted to structure and pa¬ 
rameter learning in BNs either aim at maximising a score that reflects the fit of the data 
to the learnt structure, or test for independencies between variables. They are often use d 


as references in a gene regulatory network inference context (iTsamardinos et all 2006j), 


athough mainly for moderate size networks. The first compared algorithm we used is the 
PC-algorithm ( Spirtes et all 2000l) . a popular constraint-based method that drastically 
reduces the number of conditional independence tests. It first builds the skeleton of the 
graph by removing edges from a complete undirected graph before determining the orienta- 
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tion of the edges, when possible. A large number of implementations of the PC-algorithm 
exists. The numerical results presented here were obtained using the pcAlg function of 
the R-package pcalg, based on stand ard co rrelation estimates for conditional indepence 
testing. We also ran ARACNE ( Margolin et all 2006 b an improved version of minimum- 
weight spanning tree that uses the information inequality to eliminate the majority of 
indirect relationships between variables. We used the ARACNE function of the R-package 
bnlearn. We finally compare GAD AG to the Greedy Equivalence Search (GSE) algorithm 
(Chickering, 2002h . implemented in the R-package pacalg, which heuristically searches in 
the space of equivalent classes the model with the highest Bayesian score. 


5.3 Performance metrics 

A classical performance measure for graph inference methods consists in comparing pre¬ 
dicted interactions with the known edges in the true graph Qq using precision versus recall 
(P/R) curves. We denote by TP, FP, FN and TN, the true positive (correctly predicted) 
edges, the false positive (inferred by mistake) edges, the false negative (missed) edges, and 
the true negative (correctly non-predicted) edges. The recall, defined as T p+ FN , measures 
the power (or sensitivity) of reconstruction of non-zero elements of the true matrix G (or 
equivalently of the true network) for one method, whereas the precision, equal to TPFFP , 
measures the accuracy of the reconstruction. The closer to one the precision and the recall 
the better. 

P/R curves represent the evolution of those quantities when varying the sparsity of the 
methods. Random forests produce a ranked list of regulatory interactions, which corre¬ 
sponds to the edges of the inferred graph. Edges are then successively introduced with 
decreasing confidence scores to produce the random forest P/R curve. GAD AG and Boot- 
Lasso are based on penalized optimization: both seek linear dependencies between the 
variables with a controlled level of parsimony (A in Equation (J5J) for GADAG). For A vary¬ 
ing from 0 (complete graph) to +oo (empty graph), each of these methods produce a list of 
edges, successively introduced in the model. These lists of edges define the precision versus 
recall curves for these two approaches. Note that the implementation we used for running 
ARACNE was only able to produce a final network prediction (interaction ranking is not 
available). As a summary performance measurement, we also computed the classical area 
under the P/R curve (AUPR). 

5.4 DREAM data analysis 

The datasets we used mimic activations and regulations that occur in gene regulatory 
networks, ft is provided by the DREAM4 challenge on “In Silico Network Challenge”. 
Note that although plausibly simulated, DREAM4 data sets are not real biological data 
sets. However, the used network structures (five in total) were extracted from E. coli and S. 
cerevisae -two biological model organisms- trancriptional networks. These networks contain 
cycles, but self-loops were discarded. The gene expression observations were not simulated 
by an equal noise Gaussian multivariate model, stochastic differential equations were used 
to mimic the kinetic laws of intricate and intertwined gene regulations. In addition to the 
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biological noise simulated from the stochastic differential equations, technical noises were 
added to reproduc e act ual ge ne me asurement noise. All data sets were generated by the 


GNW software (|Marbach et ah, 2009) 


Working with simulated networks, we are able to quantitatively and objectively assess 
the merit of competing methods in terms of true predictions (true positive TP and true 
negative TN) vs. incorrect predictions (false positive FP and false negative FN) edges. 
While the analysis of a real data set is certainly the final goal of a methodology motivated 
by a real problem like ours, there are only imprecise ways of validating a method when 
analysing a real data set. Well known systems are often small and even if knowledge has 
accumulated on them, these can be noisy and difficult to gather to obtain a fair picture 
of what can adequately be considered as sets of true positive and true negative sets of 
edges. Even if the data generation process of the DREAMT In Silico Network Challenge is 
completely understood, no existing method is able to predict all regulatory relationships, 
but at the price of including many false positive predictions. The DREAM4 datasets we 
considered have p = 100 nodes and only n = 100 observations making it a a very challenging 
task. 


5.5 Numerical results 

The P/R curves for the five DREAM problems are shown in Figure [5] Each curve corre¬ 
sponds to one of the five networks used in the challenge. In general, for all the problems 
the five methods are able to achieve a precision equal to one (that is, to include only true 
edges), but these correspond to overly sparse graphs (very small recall). Conversely, a 
recall equal to 1 can only be reached by adding a large number of FP edges, whatever the 
method we consider, even if some fail earlier than others. The main differences between 
the methods appear on the leftmost part of the P/R curves, especially those of Figure [5] 
B, C and D: while the precision of BootLasso, Genie3, PCalg and GSE drops rapidly with 
a slow increases in recall above 20% recall, it remains higher for GADAG. Hence, its first 
predicted edges are at least as accurate than those of the four other methods and it pro¬ 
duces a larger set of reliable edges. For graphs of lesser sparsity, none of the five methods 
is really able to identify clearly reliable edges. Large number of FP edges are produced to 
achieve a recall higher than 60%. 

For Networks 1 and to a lesser extent 5 (Figure 0 A and E), GADAG recovers with 
more difficulty the first true edges than other methods, with a high level of FP edges at 
the beginning of the curve (low precision and low recall). However, as soon as the recall 
exceeds the 10%, resp. 15%, for graph A, resp. for graph E, GADAG performance is again 
superior to that other methods. 

Table [2] gives the areas under the P/R curves for all methods and networks. For this 
indicator, GA significantly outperforms the state-of-the-art methods for all networks. 


6 Conclusion and discussion 

In this paper, we proposed a hybrid genetic/convex algorithm for inferring large graphs 
based on a particular decomposition of the -penalized maximum likelihood criterion. We 
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D 




Recall 


Recall 


E 



Recall 


Figure 5: P/R curves for the five Dream networks and the five compared methods. 


Method 

Net 1 

Net 2 

Net 3 

Net 4 

Net 5 

GADAG 

0.182 

0.236 

0.348 

0.317 

0.267 

Genie3 

0.154 

0.155 

0.231 

0.208 

0.197 

BootLasso 

0.118 

0.061 

0.171 

0.147 

0.169 

PCalg 

0.116 

0.089 

0.171 

0.149 

0.130 

GSE 

0.101 

0.089 

0.170 

0.153 

0.133 


Table 2: Area under the Precision vs. Recall curve for all networks and methods (except 
ARACNE). 


obtained two oracle inequalities that ensure that the graph estimator converges to the true 
graph under assumptions that mainly control the model structure: graph size (balance 
between sparsity, number of nodes and maximal degree) and signal-to-noise ratio. From 
an algorithmic point of view, the estimation task is split into two subproblems: node 
ordering estimation and graph structure learning. The first one is a non-trivial problem 
since we optimize over a discrete non-convex large dimensional set. It led us to use a 
heuristic approach we specifically tailored to achieve the optimization task. The second 
one is a more common problem, related to the Lasso one, for which we proposed a sound 
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procedure with theoretical guarantees. The potential of such an approach clearly appeared 
in the numerical experiments, for which the behavior of our algorithm seemed to be very 
competitive when compared to the state-of-the-art. 

Nevertheless, we see many opportunities for further improvements. First, convergence 
proof for the algorithm, although a challenging task, is worth investigating, for instance 
using the works of Cerfl ( 1998 1 on genetic algorithms. An alternative would be to consider 
other optimization schemes for the n ode or dering with more established convergence proofs 
(e.g. simulated annealing ( Granville et all 1994 1). 

Second, other potential extensions involve algorithmic considerations in order to im¬ 
prove the calculation time, including a finer calibration of the algorithm parameters, an 
initialization step for the gradient descent, and, in general, improving the interactions be¬ 
tween the nested and outer loops. Tackling very large datasets from several thousands of 
nodes may also require a particular treatment, for instance by adding local search operators 
to GADAG. 

Finally, we would like to emphasize the graph identihability problem: in our settings, we 
assume the noise variances of all graph nodes to be equal to ensure graph identihability (that 
is no equivalence class of graphs). Such a hypothesis is of course restrictive and likely to be 
violated for real datasets. In order to infer networks for any noise variances, one solution 
consists in incorporating interventional data on the model. These data are obtained from 
perturbations of the biological system ( e.g. ge ne kn ocko uts or over-expressions) and make 
the equivalence class of graphs smaller (lHauser fe Buhlniann . 2012 1. Then, the use of such 
ne w data could be combined wi th observational on the MLE estimator (as recently propo sed 
by Hauser fc Biihlma/nn ( 2015 1 for a BIC-score penalized MLE, or bv lRau et al. ( 2013 1 for 
learning Gaussian Bayesian networks in the case of GRN inference) and a modification of 
our hybrid algorithm could lead to the identification of the true graph. 
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